Goto

Collaborating Authors

 amino-acid sequence


Protein as a Second Language for LLMs

Chen, Xinhui, Li, Zuchao, Gao, Mengqi, Zhang, Yufeng, Leong, Chak Tou, Li, Haoyang, Chen, Jiaqi

arXiv.org Artificial Intelligence

Deciphering the function of unseen protein sequences is a fundamental challenge with broad scientific impact, yet most existing methods depend on task-specific adapters or large-scale supervised fine-tuning. We introduce the "Protein-as-Second-Language" framework, which reformulates amino-acid sequences as sentences in a novel symbolic language that large language models can interpret through contextual exemplars. Our approach adaptively constructs sequence-question-answer triples that reveal functional cues in a zero-shot setting, without any further training. To support this process, we curate a bilingual corpus of 79,926 protein-QA instances spanning attribute prediction, descriptive understanding, and extended reasoning. Empirically, our method delivers consistent gains across diverse open-source LLMs and GPT-4, achieving up to 17.2% ROUGE-L improvement (average +7%) and even surpassing fine-tuned protein-specific language models. These results highlight that generic LLMs, when guided with protein-as-language cues, can outperform domain-specialized models, offering a scalable pathway for protein understanding in foundation models.


AlphaFold Spreads through Protein Science

Communications of the ACM

Two years ago, as the COVID-19 pandemic swept across the world, researchers at DeepMind, the artificial intelligence (AI) and research laboratory subsidiary of Alphabet Inc., demonstrated how it could use machine learning to achieve a breakthrough in the ability to predict how proteins, the work-horses of the living cell, fold into the intricate shapes they take on. The work gave hope to biologists that they could use this kind of tool to tackle diseases such as the SARS-CoV-2 coronavirus much more quickly in the future. Researchers were able to assess the abilities of DeepMind's AlphaFold2 thanks to its inclusion in the 14th Critical Assessment of Structure Prediction (CASP14), a benchmarking competition that ran through 2020 and which added a parallel program to uncover the structures of key proteins from the SARS-CoV2 virus to try to accelerate vaccine and drug development. The organizers of CASP14 declared the tool represented "an almost complete solution to the problem of computing three-dimensional structure from amino-acid sequences," though some caveats lie behind that statement. In principle, quantum mechanical simulations can predict which collection of folds leads to the lowest combined energy of all the chemical bonds in the shape and the water and other molecules around it.


Physics - Machine-Learning Model Reveals Protein-Folding Physics

#artificialintelligence

Proteins control every cell-level aspect of life, from immunity to brain activity. They are encoded by long sequences of compounds called amino acids that fold into large, complex 3D structures. Computational algorithms can model the physical amino-acid interactions that drive this folding [1]. But determining the resulting protein structures has remained challenging. In a recent breakthrough, a machine-learning model called AlphaFold [2] predicted the 3D structure of proteins from their amino-acid sequences.


Macromolecule Classification Based on the Amino-acid Sequence

Ghaffar, Faisal, Khan, Sarwar, O., Gaddisa, Yu-jhen, Chen

arXiv.org Artificial Intelligence

Deep learning is playing a vital role in every field which involves data. It has emerged as a strong and efficient framework that can be applied to a broad spectrum of complex learning problems which were difficult to solve using traditional machine learning techniques in the past. In this study we focused on classification of protein sequences with deep learning techniques. The study of amino acid sequence is vital in life sciences. We used different word embedding techniques from Natural Language processing to represent the amino acid sequence as vectors. Our main goal was to classify sequences to four group of classes, that are DNA, RNA, Protein and hybrid. After several tests we have achieved almost 99% of train and test accuracy. We have experimented on CNN, LSTM, Bidirectional LSTM, and GRU.


Is artificial intelligence deserving of all the hype?

#artificialintelligence

Artificial intelligence is moving into all areas of engineering, science, business and industry; indeed, AI is now the dominant approach, pushing others to the background. Recently, DeepMind, owned by Google, demonstrated an algorithm called AlphaFold to predict the three-dimensional structure of a protein from its amino-acid sequence. This is a fundamental problems in biology. Laboratory methods are laborious and therefore progress has been slow. AlphaFold would make the process very fast and thereby greatly accelerate important applications such as discovering new drugs.


DeepMind Makes History Again By Solving a 50-Year-Old Problem In Biology

#artificialintelligence

You may have heard about "DeepMind" in the past, and if you haven't, now you will. To this day, DeepMind has acquired a number of achievements since it was founded, but it is most notable for AlphaGo, an AI program that beat some of the best professional Go players in history including Ke Jie. DeepMind's AlphaFold 2 can now identify a protein's three-dimensional structures from its amino-acid sequence to the width of an atom. To give some context, AlphaFold2 competed with over 100 research groups worldwide in a competition known as the Critical Assessment of Protein Structure Prediction, or CASP. The goal was exactly what AlphaFold 2 achieved, to be able to predict a protein's structure from its amino-acid sequence.